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Abstract —We propose a layered street view model to encode 
both depth and semantic information on street view images for 
autonomous driving. Recently, stixels, stix-mantics, and tiered 
scene labeling methods have been proposed to model street view 
images. We propose a 4-layer street view model, a compact 
representation over the recently proposed stix-mantics model. 
Our layers encode semantic classes like ground, pedestrians, 
vehicles, buildings, and sky in addition to the depths. The only 
input to our algorithm is a pair of stereo images. We use a deep 
neural network to extract the appearance features for semantic 
classes. We use a simple and an efficient inference algorithm to 
jointly estimate both semantic classes and layered depth values. 
Our method outperforms other competing approaches in Daimler 
urban scene segmentation dataset. Our algorithm is massively 
parallelizable, allowing a GPU implementation with a processing 
speed about 9 fps. 


1. Introduction 

Consider a typical road scene as shown in Eigure while 
driving a car. We first observe the immediate road, nearby 
obstacles (pedestrians, cars, cyclists, etc), followed by adjacent 
buildings and sky. These scene entities, or objects, can be 
layered in a typical road scene based on their locations. 
Understanding such a scene would require us to know the 
type of objects and spatial locations in the 3D world. Most 
conventional approaches look at this as two different problems: 
3D reconstruction and object class segmentation. Recently, 
both these problems have been merged and solved as a single 
optimization problem. Along this avenue, several challenges 
exist. Prior segmentation algorithms focus on classifying each 
pixel individually into different semantic object classes. Such 
approaches are computationally expensive and may not respect 
the layered constraint that is preserved in most road scenes. 
In this paper, we jointly infer the semantic labels and depths 
of road scenes using the layered structure of street scenes. 

In Eigure we use four layers to represent a street view 
image. The layers are ordered from the bottom of the image 
and they are associated with semantic classes. The first layer 
consists of only the ground. The second layer can have 
dynamic objects-vehicles, pedestrians, and cyclists. The third 
layer can only have buildings. The fourth layer can only 
contain sky. Each of these layers are supposed to model planar 
objects standing upright with respect to the ground at various 
distances from the camera. The transition between the layers 
happen at places where there is depth variation. In most road 
scenes, four layers are sufficient to model important object 
classes along with their layered depths. Our approach can fail 



Fig. 1. We propose a layered interpretation algorithm for street scene 
understanding. We use four layers to represent every street scene image. We 
encode the semantic classes of several important object classes in the region 
between these layers. These classes include ground, pedestrians, cars, building, 
and sky. In addition to semantic classes, we also model and compute the layer- 
aware depth of the scene. In other words, the depth increases for pixels as 
we move to the top of the image along an image column. 


in challenging scenarios with bridges or tunnels. However, 
these cases can be discovered by survey vehicles in an offline 
process. Autonomous vehicles can be alerted as we encounter 
these regions using GPS and we can use additional layers to 
correctly interpret such challenging regions. 

We focus on obtaining layer-aware semantic labels and 
depths jointly from street-view images. We use a stereo camera 
setup and compute the disparity cost volume for depth cues. 
Eor obtaining semantic cues, we use a deep neural network, 
which extracts deep features from intensity images. The depth 
and semantic cues are formulated in an energy function 
that respects the layered street scene constraint. We propose 
an inference algorithm based on dynamic programming to 
efficiently minimize this energy function. Our inference al¬ 
gorithm is massively parallelizable. We develop a parallel 
implementation and achieve a 8.8 fps processing speed on 
GPU. Our method outperforms the competing algorithms that 
do not enforce the layered constraint. Our inference algorithm 
is general and can work on the data from other modalities 
including LIDAR and Radar sensors. 

A. Related work 

Our work is related to the general area of semantic seg¬ 
mentation and scene understanding, such as IH, ifTll . fWl , 
EH, lO, ifT^ . 1^ , l25ll . While earlier approaches were based 
on hand-designed features, it has been shown recently that 
using deep neural networks for feature learning leads to better 
performance on this task lO, ifT^ , ||23l, IZTl . 

The problem of jointly solving both semantic segmentation 






and depth estimation from stereo camera was addressed in 1201 
as a unified energy minimization framework. Our work focuses 
on semantic labeling using ordering constraint on road scenes 
and using fewer classes applicable to road scenes. In CSl, a 
typical road scene is classified into ground, vertical objects 
and sky to estimate the geometric layout from a single image. 
Objects like pedestrians and cars are segmented as vertical 
objects. This would be an under-representation for road scene 
understanding. El modeled the scene using two horizontal 
curves that divide the image into three regions: top, middle, 
and bottom. 

One popular model for road scene is the stixel world that 
simplifies the world using a ground plane and a set of vertical 
sticks on the ground representing obstacles O. Stixels are 
compact and efficient representation for upright objects on 
the ground. The stixel representation can simply be seen as 
the computation of two curves. The first curve runs on the 
ground plane enclosing the free space that can be immediately 
reached without collision, and the second curve encodes the 
vertical objects boundary. In order to compute the stixel world, 
either depth map from semi-global stereo matching algorithm 
(SGM) Ea or cost volume m can be used. As with SGM, 
dynamic programming (DP) enables fast implementation for 
the computation of the stixels. Recently, ifSOl demonstrated a 
monocular free-space estimation using appearance cues. 

Stix-mantics (251, a recently introduced model, gives more 
flexibility compared to stixels. Instead of having only one 
stixel for every column, they allow multiple segments along 
every column in the image and also combine nearby segments 
to form superpixel-style entities with better geometric mean¬ 
ing. Using these stixel-inspired superpixels, semantic class 
labeling is addressed. 

We focus on obtaining layer-aware semantic labels and 
depths jointly from street-view images. Our work is closely 
related to many existing algorithms in vision, and most notably 
with tiered scene labeling El, joint semantic segmentation 
and depth estimation (201 . stixels and more recently, stix- 
mantics ca. Our approach achieves real-time processing 
speed and outperforms the competing algorithms (^ in 
accuracy. We also achieve this performance without using 
explicit depth estimation and temporal constraints, which can 
be obtained using visual odometry. Similar to layered street 
view constraint, Manhattan constraints have been useful in 
indoor scene understanding ED, mol- 

The paper is organized as follow. In Section [Ilj we present 
the problem formulation. Section HI discusses the extraction 
of depth and appearance features. The inference algorithm 
and its implementation are described in Section |IV] and |V] 


Experiments are presented in Section VI 


II. Layered Street View 

Our goal is to jointly estimate semantic labels and depth for 
each pixel in the street view image using both appearance and 
depth information. We adopt a layered image interpretation. 
An image is horizontally divided into four layers of different 
semantic and depth regions. The layers are ordered from the 
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Fig. 2. Illustration of the layered street view problem formulation: Each 
column of the image is divided up to four horizontal layers. The four layers 
are ordered from the bottom of the image. The model is compact and effective 
in representing a wide variety of scenarios in a typical road scene. 


bottom to the top. In each image column, pixels belong to 
the same layer have the same semantic label and depth. The 
only exception is that depth of the pixels in the ground layer 
vary according to their vertical image coordinate, which is 
determined by the ground plane. The ground plane can either 
be obtained in an offline external calibration process or in 
an online estimation process such as using the v-disparity 
map CD. We enforce a depth-order constraint, i.e. , depth 
of a lower layer is always smaller than depth of a higher layer 
in each image column. 

In our four layer model, the first layer can only have ground. 
The second layer can have pedestrians or vehicles. The third 
layer can have only buildings. The fourth layer can only have 
sky. Note that we do not enforce that each image column 
has exactly four layers. A column can have any number of 
layers between one to four. If a layer is absent at a particular 
image column, the bottom of its upper layer and the top of 
its bottom layer are next to each other. This implies that the 
curves defining these layers need not be a smooth continuous 
one. The four-layer model provides a flexible method for 
enforcing geometry and semantics to the scene. The only 
assumption required is the planar world assumption, which 
is not restrictive for many applications requiring street scene 
understanding. If necessary, the geometry of the layered street 
scene model can be further enhanced to dense depth map 
with additional computational cost. Similarly, the model can 
be improved with more layers and semantic classes. 

A. Problem Formulation 

Notations: We use W = {1,2, ...,IU} and FL = 

{l,2,...,i^} to refer the sets that hold the horizontal x and 
vertical y coordinates respectively. We consider five different 
semantic object classes; namely, ground, vehicle, pedestrian, 
building, and sky. They are denoted by the symbols G, V, 
P, B, and S respectively. The set of the semantic class labels 
is denoted by £ = {G,V, P, B,§}. We use V for the set 
of disparity values. The words disparity and depth are used 
interchangeably in the paper for ease of presentation. It is 
understood that a one-to-one conversion can be easily obtained 
by using the parameters in the camera calibration matrix. The 
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Fig. 3. Layered street view algorithm: The proposed algorithm utilizes appearance features from a deep neural network and depth features from disparity 
costs. The features are used to jointly infer the dense semantic and depth labels. 


cardinality of the semantic label space and disparity values are 
denoted by L = |£| and D =\V\ respectively. 

We formulate the layered street view problem as a con¬ 
strained energy minimization problem. The constraints encode 
the order of the semantic object class labels and depth values 
in each column. It limits the solution space of the variables 
associated with each image column. We solve the constrained 
energy minimization problem efficiently using an inference 
algorithm based on dynamic programming. 

We use the variables, hn, hi 2 , his, and hi^, to denote the 
y coordinates of the top pixels of the four layers in the image 
column i. Let In, 1^, ks, and In be the semantic object class 
labels for the four layers and let dn, dn, and dn be the 
depths of the four layers in the image column i. The ordering 
constraint and the knowledge of the ground plane allow us to 
fix some parameters. The actual number of unknowns is only 
5 given by = [hn, hn, dn]. Hence the label assign¬ 
ment for the entire image is given by x/ = [xi,X 2 , ...,xw]. 
The number of possible assignments for an image column is 
in the order 0{H^LD) since hn^hn^hn ^ 'H, li 2 ^ >C, and 
dn ^ 

To rank the likelihood of the label assignment, we use 
evidence from image appearance features and stereo disparity 
matching features. We aggregate evidence from all the pixels 
in a column to compute the evidence. Let and [//^(x^) 

be the data terms representing the semantic and depth label 
cost, respectively, incurred as assigning x^ to the image 
column i. The two terms are summed to yield the data term 

C/i(xi) = C/^(xO + C/f(xO, (1) 

denoting the cost for assigning x^ to the image column i. 
Instead of working on the standard 2D Markov Random Field 
space where each pixel can have a depth value and a semantic 
label as independent variables, we reduce the problem to a 
constrained energy minimization problem given by 

w 

( 2 ) 

X ^^ 

i=l 

s.t. hii > hi 2 > hn >hn = l, (3) 

dii < di2 < dn < dn, (4) 

hi = ^5 h2 ^ {V, P}, In = Ua = S, Vi (5) 


where the constraint Q gives the layer structure, the con¬ 
straint Q enforces the depth order, and the constraint ^ takes 
into account the possible semantic labels for each layer. The 
variable dn is a function of hn, the top pixel location of 
the ground layer, because we assume the dynamic object is 
standing upright on the ground surface at the hnth row of 
the image. The energy function in Equation models the 
relation of pixels in the same column but not pixels in the 
same row. We use image patches centering around a pixel as 
the reception fields for the feature computation at the pixel 
location. Neighboring pixels have similar reception fields and 
thus have similar features. 

The data term Ui{xi) is the cost of assigning label x^ to 
the column i. It is the sum of the pixel-wise data terms given 
by 

^i3-l 

Ui{xi)= [E^{i,y,'^) + E^{i,y,oo)) 

y=hi4 

hi2 — l 

+ £ {E^{i,y,M) + E^{i,y,dn) 

y=hi3 

hii — l 

+ {E^{i,y,li2)+E^{i,y,di2{ha))) 

y=hi2 

H 

+ y^ {E^{i,y,G) + E^{i,y,dn) (6) 

y=hii 

where the per pixel appearance data term E^{x,y,l) is the 
cost of assigning label I to the pixel {x, y) and the per pixel 
depth data term E^{x, y, d) is the cost of assigning depth d to 
the pixel {x,y). We use a deep neural network for obtaining 
the per pixel appearance data term E^{x,y,l) and use a 
standard disparity cost for obtaining the per pixel depth data 
term E^{x,y,d) detailed in Section [lll| We summarize our 
layered street view algorithm in Figure 

III. Features 

We rely on two types of features: depth and appearance. 

A. Depth 

We use the smoothed absolute intensity difference for the 
per pixel depth data term, which is commonly used in stereo 
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Fig. 4. Deep neural network for extracting the appearance features: Our deep neural network contains two parts: multi-scale convolutional neural 
network and recursive context propagation network. The multi-scale convolutional neural network contains three convolutional layers and is applied to the 
Gaussian pyramid of the input image. It extracts multi-scale appearance features, which are fed into the recursive context propagation network. The recursive 
context propagation network consists of three sub-networks: the semantic mapper, the semantic combiner, and the semantic de-combiner networks. It embeds 
rich context information to the features and enhances their discriminative power. Note that the layers are color-coded. The same color is assigned to the layers 
sharing the hlter weights. A dotted line indicates that the feature map is upsampled before feeding to the succeeding layer. For further details, please refer to 
Section |nT| 


reconstruction algorithms. We first compute the pixel wise 
absolute intensity difference for each disparity value in D, 
which renders a cost volume representation. A box filter is 
then applied to smooth the cost volume. The per pixel depth 
data term is given by 

E^{x,y,d) = ^ \lL{x,y - d) - lR{x,y)\ (7) 

{x ,y)^P(^x ,y) 

where II and Ir refer to the intensity values of the left and 
right images, P(x,y) is an image patch centered at {x^y), and 
N = \P(x,y)\ denotes the cardinality of the patch, serving as 
a normalization constant. The patch size is fixed to 11-by-11 
in our experiments. 

B. Appearance 

We compute per pixel appearance data term using a 
deep neural netwrok. Our network, shown in Figure con¬ 
sists of two parts: a multi-scale convolutional neural net¬ 
work (MSCNN) and a recursive context propagation network 
(RCPN). The MSCNN allows us to extract multi-scale ap¬ 
pearance cues and RCPN allows us to extract rich contextual 
information. 

MSCNN: Our MSCNN is a close variant to the neural 
network proposed in 0|. It has 3 convolutional layers. The 
first convolutional layer has 16 filters of size 8x8 followed by 
rectified linear unit (ReLU) and 2x2 max-pooling processing. 
The second layer has 64 filters of size 7x7 followed by 
ReLU and 2x2 max-pooling processing. The third layer has 
256 filters of size 7x7 followed by ReLU. The stacking of 
the convolutional layers yields a reception field of 47-by-47 
pixels. Due to max-pooling, the resolution of the output feature 
map is smaller than that of the input image. We upsample the 
feature map to have the same resolution. 

The 3-layer convolutional neural network (CNN) is applied 
separately to three scales of the Gaussian pyramid of the input 


image. Specifically, we downsample the input image using 
three different scales (1, 2, and 4) and use the 3-layer CNN 
to extract features at each scale. We use upsampling to bring 
all the feature maps to the same resolution. The feature maps 
from the three scales are concatenated to obtain the MSCNN 
feature map. Note that the filter weights are constrained to be 
the same for all the scales. The MSCNN extracts multi-scale 
appearance information for each pixel, which is then passed 
to the RCPN. For further details about MSCNN, please refer 
to El. 

RCPN: Our RCPN is a variant of the network proposed 
in EH. It consists of three sub-networks: the semantic mapper 
network, the semantic combiner network, and the semantic de¬ 
combiner network. We use the RCPN to embed rich context 
information to the output appearance features. The semantic 
mapper network is a 1-layer CNN with 128 filters of size 
1x1 followed by ReLU. It maps each 768-dimensional feature 
of the MSCNN feature map to a 128-dimensional semantic 
feature, which is then fed into the semantic combiner network. 

The semantic combiner network has three recursive layers, 
and each contains 128 filter of size 4x4 followed by ReLU. 
The semantic combiner fuses input features in a 4-by-4 region 
of the input semantic feature map to an output semantic 
feature. This process is non-overlapping; hence, each semantic 
combiner layer generates an output feature map that is 16 times 
smaller than the input one. Applying the 3-layer recursive 
combiner network renders an output feature map that is 4096 
times smaller than the input feature map of the semantic 
mapper. The semantic combiner network recursively embeds 
context information from image regions with larger and larger 
spatial support. The output feature maps from the three layers 
form a context feature pyramid, which is fed into the semantic 
de-combiner network. 

Similar to the semantic combiner network, the semantic 
de-combiner network has three recursive layers and each 

































contains 128 filter of size 1x1 followed by ReLU. It is used 
to recursively distribute context information residing in the 
context pyramid back to the individual pixels, from higher to 
lower levels of the context pyramid. Each de-combiner layer 
fuses the feature map from the previous de-combiner layer 
with that from the corresponding level of the context pyramid. 
Note that our RCPN implementation differs from lIZTll in two 
major places: 1 ) we use square patches for context propagation 
while fTf\ uses superpixels ( 221 , and 2 ) we use pyramids to 
represent hierarchy of context information while EH uses 
superpixel trees. Our design choices allow a more efficient 
implementation because we do not use superpixel segmenta¬ 
tion. 

Training: We use grayscale images. The pixel intensity 
values are scaled between 0 to 1 and centered by subtracting 
0.5 before being fed into the deep neural network. To train 
the network, we connect the output layer to a fully connected 
layer having 5 neurons, corresponding to ground, pedestrian, 
vehicle, building, and sky classes. The fully connected layer is 
followed by the softmax layer. The network is trained by min¬ 
imizing the cross-entropy error via stochastic gradient descent 
with momentum. We use the Caffe library ifTTll for training. 
The number of pixels in the semantic classes can be quite 
different. To avoid the bias from dominant classes (ground 
and building), we weight the cross-entropy loss based on the 
semantic class distribution, which yields better performance in 
practice. 

We use the negative logarithm of the softmax scores of the 
semantic classes as the per-pixel appearance data terms. Let 
/) be the softmax score of the deep neural network 
at pixel location {x,y), which represents the probability of 
assigning the label I to the pixel. The per pixel appearance 
data term is given by 

y, 1) = -p log f{x, y, 1) ( 8 ) 

where /3 is a parameter controlling the relative weight of the 
appearance and depth data terms. 

IV. Efficient Inference Algorithm 

We decompose the energy minimization problem in Equa¬ 
tion 0 into W sub-problems where the ith sub-problem is 


given by 

minf/^(x^) (9) 

s.t. hii > hi 2 > ha >hii = l ( 10 ) 

dii < di2 < da < da ( 11 ) 

hi — ^5 h2 C {V, P}, hs = Ua. = (12) 


We solve each of the sub-problems optimally and combine 
their solutions to construct the semantic labeling and depth 
map of the image. Eor simplicity, we will drop the subscript 
i in the discussion below. 

Each of the sub-problems can be mapped to a ID chain 
labeling problem. The chain has 4 nodes where the first 
node contains the variables the second node 

contains the variables (/ 12 , <^ 2 , ^ 2 ), the third node contains 


the variables (/ 13 , ds,/s), and the fourth node contain the 
variables Utilizing the recursion in the label cost 

evaluation, a standard dynamic programming algorithm can 
solve the inference on the ID chain with a complexity of 
0{{HDL ■ HY) where the product HDL represents the size 
of label space at each node and the second H comes from 
the label cost evaluation at each node. Unfortunately, the 
complexity is too high for real-time applications. 

We propose a variant of the dynamic programming algo¬ 
rithm to reduce the complexity of solving the sub-problem 
in ([^ to 0{H^D) and achieve real-time performance. We 
first note that some of the variables are known from our street 
view setup as discussed in the problem formulation section. 
We only need to search the values for /ii ,/125 ^ 2 , For 

any combination of hi /12 and I 2 , we need to find the best 
combination of ds and / 13 . In the following, we show that 
pre-computing the best combination of ds and hs for any hi 
h 2 and I 2 can be achieved in 0{H‘^D) time using recursion. 

We first observe that the problem in ([^ can then be written 
as 

hi-l 

min V {E^{i,y,l2) + E^{i,y,d2{hi))) + 

hi,h2,l2 

y=h2 

H 

Y, {E^ih y, <G) + E^{i, y, di) + Qi{h,, h2)) 

y=hi 

(13) 


where Qi is an intermediate cost table given by 

/I3 —1 

Qi{hi,h 2 )= min V 

d3>d2{hi)V=h4, 

h2 — l 

+ E^{i,y,(x^))+ Y {E^{i,y,M) + E^{i,y,ds)) (14) 

y=h3 

Note that depth of the second layer object d 2 is a function of 
hi because ^2 can be uniquely determined from hi and the 
ground plane equation. As a result, Qi depends on both hi 
and h 2 . 

By integrating E^{i,y,M)+E^{i,y,d 3 ) and E^{i,y,S) + 
E^{i, y,oo) along the y direction, the sum given by 


/13 —1 

Ri{h 2 ,h 3 ,d 3 ) = Y + E^{i,y,oo)) + 

y=h4 

h2 — l 

Y {E^{hy,^) + E^{i,y,d3)) ( 15 ) 

y=h3 

for all combination of /12 and hs can be computed in 
time for each ^ 3 . We further note that Qi can be computed 
via a recursive update rule given by 

Qi{hi + 1 ,( 12 ) = h2),m.mRi{h2, hs, d)) ( 16 ) 

h3 


where d is an integer satisfying d 2 {hi) < d < d 2 {hi -f 1 ) and 
is used to ensure that the depth ordering constraint between the 


second and third layers is met. Intuitively, we are computing 
a running min structure along the decreasing depth of the 
building layer. The recursive update rule allows us to compute 
Qi for any hi and h 2 in 0{H‘^D) time. As a result, the 
complexity of finding the best configuration for a partition can 
be reduced to = 0{H‘^D) where H‘^L is the 

time required for searching combinations of hi h 2 and I 2 . We 
perform the ID labeling algorithm to each image column. The 
overall complexity of the labeling algorithm is 0{WH‘^D). 

V. Implementation 

Our algorithm is massively parrallelizable and can be im¬ 
plemented using CUDA, a general purpose parallel computing 
language for NVIDIA GPUs Q. A GPU comprises a large 
number of Single-Instruction-Multiple-Data (SIMD) processor 
cores to allow many threads to execute common operations 
concurrently on large data arrays. In our implementation of 
the labeling algorithm, we exploit data level parallelism in all 
stages of computation. 

1 Depth data term: We use WxH threads to compute the 
disparity values for each pixel at (x^y). We implemented 
the box filter using the sliding window approach. First, we 
use H threads to perform ID sliding window on each row. 
As the window moves from left to right in the horizontal 
direction, the new pixel value on the right is added and 
the existing one on the left is subtracted. We then use 
W threads to carry out the same computation for each 
column. 

2 Appearance data term: We use the Caffe library ifTTll 
to compute the softmax scores output of the deep neural 
network. We use NVIDIA s cuDNN library to achieve 
additional speed up. 

3 Intermediate cost table: We first compute the integral of 
E^{i, y, M)+E^{i, y, ds) and E^(i, y, S)+E^{i, y, oo) 
over y such that the two sum terms can be retrieved 
in constant time for any range. This can be done in 
parallel for each column i. We observe that for a fixed 
h 2 , computing Ri{h 2 ,hs,ds) over each i and hs can be 
jointly parallelized. Therefore we use h 2 xW threads to 
compute an intermediate table for ds and hs and use W 
threads to find the combinations that yield the minimum 
cost for each hi and /i 2 - 

4 Energy minimization and labeling: Using previously 
computed Q, we use W threads to search the with the 
minimum cost in each column in parallel. 

Memory layout is an important factor for processing speed 
in GPU. By default, our image data is stored in row-major 
form. The GPU implementation naturally takes advantage of 
this memory layout by assigning threads to work on pixels 
on the same row, which resides in memory as a continuous 
array. This allows GPU to coalesce the memory accesses 
of the threads such that the GPU memory bandwidth is 
efficiently utilized. In addition, our algorithm avoids reshaping 
or transposing the data in the memory, which would take extra 
memory and time. 


We execute our algorithm on a Windows 7 desktop com¬ 
puter equipped with NVIDIA Tesla K40 GPU along with Intel 
i7 processor. We set the size of one-dimension thread blocks 
to be 64 and the size of two-dimension thread blocks to be 
32 X 32 to facilitate efficient scheduling of the threads on target 
GPU. To avoid register spills to local memory, we minimize 
local variable declarations. In our algorithm, no data needs 
to be shared among threads within a block, therefore shared 
memory is not used in the implementation. 

VI. Experiments 

Benchmark: We evaluated our approach using the pub¬ 
lic Daimler Urban Segmentation dataset 1241 . The dataset 
contains 500 stereo grayscale image pairs with pixel-wise 
semantic class annotations for the left images. While the image 
size in the dataset is 1024x440, only the middle region from 
(24,40) to (1000,400) is fully labeled. Hence, the effective 
image size is 976x360. The dataset is composed for evaluating 
only the semantic labeling using stereo image pairs. There is 
no ground truth for depth. 

The semantic labels in the annotations include ground, sky, 
building, pedestrian, vehicle, curbs, bicyclist, motorcyclist, and 
background clutters. However, only the ground, sky, build¬ 
ing, pedestrian, and vehicle are considered in the evaluation 
protocol. The performance metric for the semantic labeling 
is based on the PASCAL VOC intersection over union (loU) 
measure Q, which is the ratio of cardinality of the intersection 
of the ground truth and estimated semantic segments over that 
of their union. Let S and S be the set of pixels labeled as sky 
in the computed semantic class label map and the ground truth 
label map, respectively. The loU measure of the sky class is 
given by 

loU {Sky) = ( 17 ) 

^ ' |5U5| 

The larger the loU measure, the better the matching between 
the ground truth and estimated segments; and, hence, the better 
the semantic labeling accuracy. 

Evaluation: We followed the evaluation protocol described 
in 1^ . which used the first 300 stereo image pairs in the 
dataset for training and the remaining 200 stereo image pairs 
for testing. During testing, we downsampled the input images 
by half in each dimension. This was necessary for our GPU 
implementation. During evaluation, we upsampled the image 
to the original size. 

We use left images of the stereo pairs for training our 
deep neural network. During testing, the network outputs per- 
semantic-class softmax scores for each pixel. We compared 
our approach with several approaches. The competing ap¬ 
proaches include the joint-optimal ALE (ALE) algorithm (201, 
the stix-mantics (251 . the Darwin pairwise (Hi, and the PN- 
RCPN (26l. The algorithms in (20l, f26l utilizes superpixel 
segmentations as input, which demands additional computa¬ 
tion resources. The stix-mantics algorithm t25\ uses depth, 
obtained using an LPGA chip, and temporal constraints from 



TABLE I 

The table compares the proposed method with several competing algorithms on semantic object class labeling accuracy using 
THE Daimler Urban Segmentation Dataset. The results are shown in percentage using the IoU measure. 


Class\Method 

Joint-Opt. 

ALE 

Stix-mantics 

ED 

Darwin-pairwise 

El 

PN-RCPN 

El 

Appearance 

features 

Proposed 

Ground 

94.9 

93.8 

95.7 

96.7 

96.7 

96.4 

Vehicle 

76.0 

78.8 

68.7 

79.4 

80.7 

83.3 

Pedestrian 

73.1 

66.0 

21.2 

68.4 

61.3 

71.1 

Sky 

95.5 

75.4 

94.2 

91.4 

87.6 

89.5 

Building 

90.6 

89.2 

87.6 

86.3 

87.4 

91.2 

Avg (all) 

86.0 

80.6 

73.5 

84.5 

82.8 

86.3 

Avg (dynamic) 

74.5 

72.4 

44.9 

73.8 

71.0 

77.2 

Runtime per frame in second 

111 

0.05 

N/A 

2.8 

0.07 

0.11 



Fig. 5. Visualization: The figure visualizes the output computed from the proposed method. From top to bottom, we show the left images, the right images, 
the ground truth semantic labeling, the semantic labeling, and depth. The black regions are the regions where the ground truths are not available. 


adjacent stereo images. We achieve better accuracy and com¬ 
parable computational performance without using an FPGA 
chip and temporal constraints. 

In Table |T| we compare the semantic labeling accuracy of the 
competing algorithms. The results of the competing algorithms 
are duplicated from the Daimler dataset website (TJ . Note that 
the results in the website are different from those reported 
in the original paper 1 ^ because the unlabeled pixels were 
initially not excluded from the IoU computation in the original 
paper (251. 

Performance: In the table, we show that our method 
achieves the state-of-the-art performance of 86.3%, while ALE 
achieves an accuracy of 86.0%. The stix-mantics algorithm 
achieves an accuracy of 80.6%, which is significantly lower 
than our method and ALE. In terms of the performance for 
the dynamic objects (vehicles and pedestrians), the proposed 
algorithm achieves an accuracy of 77.2%, outperforming ALE, 
which gets 74.5%. In terms of speed, we are several magnitude 
faster than ALE. We generate both depth and semantic labels 


in 114.1 ms, while ALE takes 111,000 ms. The stix-mantics 
algorithm is slightly faster than our method requiring only 
50 ms. However, they use an FPGA chip to precompute the 
depth before estimating the semantic labels, whereas we jointly 
compute both semantic labels and depth. 

In Figure we visualize the semantic labels and depth from 
the proposed algorithm. Qualitatively, we find that we obtain 
visually accurate semantic labels and our depth map resembles 
a piece-wise planar approximation of the 3D scene. 

We observe that our appearance features alone performs 
quite well. It achieves an average accuracy of 82.8%. The 
layered constraint allows us to avoid impossible labelings such 
as having ground regions in the middle of sky, or having 
vehicles in the middle of a building, etc. By incorporating 
this constraint, our performance improves to 86.3%, which 
corresponds to a 20.3% error reduction. This improvement can 
be qualitatively seen in a few examples in Figure 

We report the execution time required by the individual 
steps in our algorithm in Table ^ All the computation is 


















Fig. 6. The advantages of the layered constraint: Top: Semantic labeling 
using only the appearance features. Bottom: Semantic labeling using the 
layered constraint. This constraint avoids geometrically impossible labeling, 
which may occur while using only appearance features extracted using the 
deep neural network. 


TABLE II 

Algorithm Component execution time 


Component name 

Execution time in ms 

Depth data term (cost volume) 

5.2 

Appearance data term (DNN) 

70.0 

Intermediate table Q 

25.6 

Inference 

13.3 

Overall 

114.1 


performed in the GPU. Overall, our algorithm takes 114.1 ms 
to infer the semantic labels and depth. The run time can be 
further reduced by half, by utilizing a second GPU card for 
processing the neural network computation. 

VII. Conclusion 

We propose a novel layered street view model and develop 
an efficient algorithm to jointly estimate semantic labels and 
depth for street view images. We obtain this result using 
appearance features, which can be computed from a deep 
neural network, and depth features, which can be derived from 
stereo disparity costs. Our algorithm outperforms the compet¬ 
ing methods on the Daimler Urban Segmentation data set and 
runs at 8.8 frame-per-second using a GPU implementation. 
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